Analysis of variance table for multiple linear regression, 1 of 3
SSR or \(SS_{regression}=\Sigma(\hat Y_i-\bar Y)^2\)
SSE or \(SS_{error}\) or \(SS_{residual}=\Sigma(Y_i-\hat Y_i)^2=\Sigma\ e_i^2\)
SST or \(SS_{total}=\Sigma(Y_i-\bar Y)^2\)
Analysis of variance table for multiple linear regression, 2 of 3
\(df_{regression}=k\)
\(df_{error}=n-k-1\)
\(df_{total}=n-1\)
\(MS=SS/df\)
Analysis of variance table for multiple linear regression, 3 of 3
\(F=MSR/MSE\)
This tests the hypotheses
\(H_0:\ \beta_1=\beta_2=\ldots=\beta_k=0\)
\(H_1:\ \beta_j \ne 0\) for at least one j
Accept \(H_0\) is F is close to 1
Reject \(H_0\) is F is much larger than 1
Example using fat data
R-squared
\(R^2=SSR/SST\) or \(1-SSE/SST\)
Proportion of explained variation
Example using the fat data
10,548.480/15,079.017 = 0.70
Adjusted \(R^2\)
\(1-\frac{MSE}{MST}\) or
\(1-\frac{SSE}{SST}\frac{(n-1)}{(n-k)}\)
Field textbook suggests a more complex formula
Penalizes for model complexity (but not enough)
Live demo, ANOVA and R-squared
Break #2
What you have learned
Analysis of variance table
What’s coming next
Variable selection
Avoid needlessly complex regression models
“Everything should be made as simple as possible, but not simpler” Albert Einstein (?)
“If you have two competing ideas to explain the same phenomenon, you should prefer the simpler one.” Occam’s Razor
“The parsimony principle for a statistical model states that: a simpler model with fewer parameters is favored over more complex models with more parameters, provided the models fit the data similarly well.” - ClubVITA
“Ginny!” said Mr. Weasley, flabbergasted. “Haven’t I taught you anything? What have I always told you? Never trust anything that can think for itself if you can’t see where it keeps its brain?” from Harry Potter and the Chamber of Secrets, J.K. Rowling.
Use a mixture
Use a mixture of science/medicine with automated approaches?
Story about an industrial process
Live demo, stepwise regression
Break #4
What you have learned
Stepwise regression
What’s coming next
Residual analysis
Using residuals to check for assumption violations
Non-normality
QQ plot
Lack of independence
Time sequence plot, Durbin-Watson statistic
Only for time-ordered data
Unequal variances
Sactterplot of residuals versus predicted values
Non-linearity
Scatterplot of residuals versus each independent variable
Scatterplot of residuals versus predicted values
Q-Q plot of residuals
Scatterplot of residuals and chest circumference
Scatterplot of residuals and abdomen circumference
Scatterplot of residuals and hip circumference
Scatterplot of residuals and predicted values
Live demo, residual plots
Break #5
What you have learned
Residual analysis
What’s coming next
Collinearity
What is collinearity?
Strong interrelationship among the independent variables
Also known as
multi-collinearity
near collinearity
ill-conditioning
Interrelationship could be just two variables
Also could be three or more interrelated variables
Examples of collinearity
Birthweight and gestational age predicting length of stay
Size of the home and size of the lot predicting sales price
Calories from fat, from protein, and from carbohydrates predicting weight gain
What problems are caused by collinearity?
Difficulty in variable selection
Loss of precision
wider confidence intervals
Loss of power
Need for larger sample sizes
Not a violation of assumptions
Not a problem if you are only interested in prediction
Fixing collinearity
Collect more data
Oversample “rare” corners
Prune your variables
Measures of collinearity
Correlation matrix
Tolerance
\(Tol_i=1-R_i^2\)
\(R_i^2\) for predicting \(i^{th}\) independent variable from remaining independent variables
Variance inflation factor
\(VIF_i=\frac{1}{Tol_i}\)
Increase in \(Var(\hat\beta_i)\) due to collinearity
Collinearity statistics for the fat dataset
What is perfect collinearity?
Exact relationship among independent variables
Impossible to estimate regression coefficients
Examples
Measuring temperature in both Fahrenheit and Centigrade
Thee percentages adding up to exactly 100%
Only solution: drop one or more variables
Live demo, multicollinearity
Break #6
What you have learned
Collinearity
What’s coming next
Mediation
What is mediation?
“A situation when the relationship between a predictor variable and an outcome variable can be explained by their relationship to a third variable (the mediator)”